-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47296][SQL][COLLATION] Fail unsupported functions for non-binary collations #45422
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CollationUtils.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CollationUtils.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CollationUtils.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CollationUtils.scala
Outdated
Show resolved
Hide resolved
...lyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CollationTypeConstraints.scala
Outdated
Show resolved
Hide resolved
...lyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CollationTypeConstraints.scala
Outdated
Show resolved
Hide resolved
Without updating |
@cloud-fan yes, that is a problem... should we settle only on on a more important note, even if we were to update while type coercion is a separate effort, and will probably cover other parts of the codebase, what do we think about implementing this for now? @dbatomic
|
I don't think it's safe to only handle expressions in I'd prefer only updating functions that support collation to have more fine-grained collation check, which shouldn't be many right now. |
@cloud-fan that makes a lot of sense, now new case classes should handle this:
|
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly LGTM
@@ -40,6 +40,7 @@ class StringType private(val collationId: Int) extends AtomicType with Serializa | |||
* equality and hashing). | |||
*/ | |||
def isBinaryCollation: Boolean = CollationFactory.fetchCollation(collationId).isBinaryCollation | |||
def isLowercaseCollation: Boolean = collationId == CollationFactory.LOWERCASE_COLLATION_ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove even this guy and push the check into StringTypeBinaryLcase
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is possible. StringTypeBinaryLcase does not extend StringType, and the point of this function for now is to call it on StringType object in acceptsType to check if we should let the function proceed with that input.
LGTM As a follow up we should revisit error messages. IMO it is weird to expose message with "string_any_collation" type to customer. But I think that we can do that as a follow up. |
// Cast any atomic type to string. | ||
case (any: AtomicType, StringType) if any != StringType => StringType | ||
case (any: AtomicType, _: StringTypeCollated) if any != StringType => StringType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code can be more readable if we call StringTypeCollated#defaultConcreteType
, which is StringType(0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what you meant?
@@ -205,6 +205,11 @@ object AnsiTypeCoercion extends TypeCoercionBase { | |||
case (StringType, AnyTimestampType) => | |||
Some(AnyTimestampType.defaultConcreteType) | |||
|
|||
// If a function expects a StringType, no StringType instance should be implicitly cast to | |||
// StringType with a collation that's not accepted (aka. lockdown unsupported collations). | |||
case (StringType, StringType) => None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this case match covered by the first case match case _ if expectedType.acceptsType(inType) => Some(inType)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be case (_: StringType, StringType) ...
@@ -215,6 +220,10 @@ object AnsiTypeCoercion extends TypeCoercionBase { | |||
None | |||
} | |||
|
|||
// "canANSIStoreAssign" doesn't account for targets extending StringTypeCollated, but | |||
// ANSIStoreAssign is generally expected to return "true" for (AtomicType, StringType) | |||
case (_: AtomicType, _: StringTypeCollated) => Some(StringType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this correct? StringType
does not satisfy StringTypeCollated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is. canANSIStoreAssign has a rule for casting AtomicType to StringType, but since StringTypeCollated does not extend StringType, but only AbstractDataType, this cast rule will not be picked up. But I would say this rule has to be improved to check for all canANsiStoreAssign rules.
@@ -994,8 +994,12 @@ object TypeCoercion extends TypeCoercionBase { | |||
case (StringType, datetime: DatetimeType) => datetime | |||
case (StringType, AnyTimestampType) => AnyTimestampType.defaultConcreteType | |||
case (StringType, BinaryType) => BinaryType | |||
case (st: StringType, StringType) => st |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reading the code around here, I think null
means no implicit cast.
// If a function expects a StringType, no StringType instance should be implicitly cast to | ||
// StringType with a collation that's not accepted (aka. lockdown unsupported collations). | ||
case (StringType, StringType) => None | ||
case (StringType, _: StringTypeCollated) => None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case should be put before case (StringType, a: AtomicType) =>
, otherwise it's useless
@@ -994,8 +994,12 @@ object TypeCoercion extends TypeCoercionBase { | |||
case (StringType, datetime: DatetimeType) => datetime | |||
case (StringType, AnyTimestampType) => AnyTimestampType.defaultConcreteType | |||
case (StringType, BinaryType) => BinaryType | |||
case (st: StringType, StringType) => st | |||
case (st: StringType, _: StringTypeCollated) => st |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will be covered by the last default case match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and no. The following two lines would have made a cast, but I changed them so they doesn't.
sql/core/src/test/scala/org/apache/spark/sql/CollationRegexpExpressionsSuite.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
Show resolved
Hide resolved
Could this be an onboarding task? |
thanks, merging to master! |
…ry collations ### What changes were proposed in this pull request? ### Why are the changes needed? Currently, all `StringType` arguments passed to built-in string functions in Spark SQL get treated as binary strings. This behaviour is incorrect for almost all collationIds except the default (0), and we should instead warn the user if they try to use an unsupported collation for the given function. Over time, we should implement the appropriate support for these (function, collation) pairs, but until then - we should have a way to fail unsupported statements in query analysis. ### Does this PR introduce _any_ user-facing change? Yes, users will now get appropriate errors when they try to use an unsupported collation with a given string function. ### How was this patch tested? Tests in CollationSuite to check if these functions work for binary collations and throw exceptions for others. ### Was this patch authored or co-authored using generative AI tooling? Yes. Closes apache#45422 from uros-db/regexp-functions. Lead-authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Co-authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ypeCollated ### What changes were proposed in this pull request? Renaming simpleString in StringTypeAnyCollation. This PR should only be merged after #45383 is merged. ### Why are the changes needed? [SPARK-47296](#45422) introduced a change to fail all unsupported functions. Because of this change expected inputTypes in ExpectsInputTypes had to be changed. This change introduced a change on user side which will print "STRING_ANY_COLLATION" in places where before we printed "STRING" when an error occurred. Concretely if we get an input of Int where StringTypeAnyCollation was expected, we will throw this faulty message for users. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Existing tests were changed back to "STRING" notation. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45694 from mihailom-db/SPARK-47504. Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Why are the changes needed?
Currently, all
StringType
arguments passed to built-in string functions in Spark SQL get treated as binary strings. This behaviour is incorrect for almost all collationIds except the default (0), and we should instead warn the user if they try to use an unsupported collation for the given function. Over time, we should implement the appropriate support for these (function, collation) pairs, but until then - we should have a way to fail unsupported statements in query analysis.Does this PR introduce any user-facing change?
Yes, users will now get appropriate errors when they try to use an unsupported collation with a given string function.
How was this patch tested?
Tests in CollationSuite to check if these functions work for binary collations and throw exceptions for others.
Was this patch authored or co-authored using generative AI tooling?
Yes.